Search CORE

14 research outputs found

Accelerated focused crawling through online relevance feedback

Author: Chakrabarti Soumen
Mallela Subramanyam
Punera Kunal
Publication venue
Publication date: 01/01/2002
Field of study

The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded

Memex: a browsing assistant for collaborative archiving and mining of surf trails

Author: Chakrabarti Soumen
Srivastava Sandeep
Subramanyam Mallela
Tiwari Mitul
Publication venue
Publication date: 01/01/2000
Field of study

Keyword indices, topic directories and link-based rankings are used to search and structure the rapidly growing Web today. Surprisingly little use is made of years of browsing experience of millions of people. Indeed, this information is routinely discarded by browsers. Even deliberate bookmarks are stored in a passive and isolated manner. All this goes against Vannevar Bush’s dream of the Memex: An enhanced supplement to personal and community memory. We propose to demonstrate the beginnings of a ‘Memex’ for the Web: A browsing assistant for individuals and groups with focused interests. Memex blurs the artificial distinction between browsing history and deliberate bookmarks. The resulting glut of data is analyzed in a number of ways at the individual and community levels. Memex constructs a topic directory customized to the community, mapping their interests naturally to nodes in this directory. This lets the user recall topic-based browsing contexts by asking questions like “What trails was I following when I was last surfing about classical music?” and “What are some popular pages in or near my community’s recent trail graph related to music?

Enhanced word clustering for hierarchical text classification

Author: Inderjit S. Dhillon
Subramanyam Mallela
Publication venue: ACM Press
Publication date
Field of study

CiteSeerX

Enhanced Word Clustering for Hierarchical Text Classification

Author: Inderjit S. Dhillon
Rahul Kumar
Subramanyam Mallela
Publication venue
Publication date: 01/01/2002
Field of study

In this paper we propose a new information-theoretic divisive algorithm for word clustering applied to text classification. In previous work, such "distributional clustering" of features has been found to achieve improvements over feature selection in terms of classification accuracy, especially at lower number of features [2, 28]. However the existing clustering techniques are agglomerative in nature and result in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value, thus converging to a local minimum. We show that our algorithm minimizes the "within-cluster Jensen-Shannon divergence" while simultaneously maximizing the "between-cluster Jensen-Shannon divergence". In comparison to the previously proposed agglomerative strategies our divisive algorithm achieves higher classification accuracy especially at lower number of features. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20 Newsgroups data set and a 3-level hierarchy of HTML documents collected from Dmoz Open Directory

CiteSeerX

Crossref

Enhanced word clustering for hierarchical text classification

Author: Inderjit S. Dhillon
Rahul Kumar
Subramanyam Mallela
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2004
Field of study

Crossref

Using Memex to archive and mine community Web browsing experience

Author: CHAKRABARTI SOUMEN
SRIVASTAVA SANDEEP
SUBRAMANYAM MALLELA
TIWARI MITUL
Publication venue: 'Elsevier BV'
Publication date: 01/01/2000
Field of study

Keyword indices, topic directories, and link-based rankings are used to search and structure the rapidly growing Web today. Surprisingly little use is made of years of browsing experience of millions of people. Indeed, this information is routinely discarded by browsers. Even deliberate bookmarks are stored passively, in browser-dependent formats; this separates them from the dominant world of HTML hypermedia, even if their owners were willing to share them. All this goes against Vannevar Bush's dream of the Memex: an enhanced supplement to personal and community memory. We present the beginnings of a Memex for the Web. Memex blurs the artificial distinction between browsing history and deliberate bookmarks. The resulting glut of data is analyzed in a number of ways. It is indexed not only by keywords but also according to the user's view of topics; this lets the user recall topic-based browsing contexts by asking questions like ‘What trails was I following when I was last surfing about classical music?' and ‘What are some popular pages related to my recent trail regarding cycling?' Memex is a browser assistant that performs these functions. We envisage that Memex will be shared by a community of surfers with overlapping interests; in that context, the meaning and ramifications of topical trails may be decided by not one but many surfers. We present a novel formulation of the community taxonomy synthesis problem, algorithms, and experimental results. We also recommend uniform APIs which will help managing advanced interactions with the browser.© Elsevie

Dspace at IIT Bombay

Information-Theoretic Co-Clustering

Author: Dharmendra S. Modha
Inderjit S. Dhillon
Subramanyam Mallela
Publication venue: ACM Press
Publication date: 01/01/2003
Field of study

Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the co-clustering problem as an optimization problem in information theory -- the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters

CiteSeerX

Crossref